Maine lakes data
Let’s try looking at some real data. Here we will load data on the number of freshwater fish species present in each of Maine’s 6,000 lakes. I scraped this data from a map produced by www.gulfofmaine.org and it is based on lake surveys, but I haven’t verified the monitoring protocol. In any case it will work well as an example dataset. The cleaned up data is available in the GitHub repository (link at end).
Fish counts
The first plot shows a histogram of the number of fish found in Maine lakes. The dataset includes about 2,000 lakes (quite a discrepancy from the 6,000 lakes that are supposedly found in Maine!).
In order to run the code below here using the fish data, you will need to retrieve the csv file from GitHub. The link to the repository is included at the end of this document.
# load the lake fish data
fish <- read.delim("maine_fish.txt")
# I'm filtering out some records of tiny lakes for simplicity
fish <- fish %>%
filter(acres > 2, depth_max > 2)
# Basic histogram of number of fish per lake
fh <- ggplot(fish, mapping = aes(x = fish)) +
geom_histogram(fill = "orange", binwidth = 1, color = "gray40") +
theme_brt() +
xlab("Number of fish species") +
ylab("Number of lakes")
#ggsave("fish_hist.png", fh, height = 4, width = 6, units = "in", device = "png")
fh
Lake area linear regression
First, we looked a plot of lake area (on the log base 2 scale) by fish diversity. We used a log transformation because the relationship is not linear on the measurement scale. Log 2 is convenient here because a one unit increase in lake area represents a doubling in lake size. The slides included two versions of this plot, with the second version having lake area centered to make the intercept easier to interpret.
# first I'm making a few new columns with area and depth log transformed and then centered
fish$log2_acres <- log(fish$acres, 2) # log transform base 2, 1-unit increase is a doubling in lake size
fish$log2_acres_c <- fish$log2_acres - mean(fish$log2_acres) # centers the log transformed acres
fish$log2_depth <- log(fish$depth_max, 2)
fish$log2_depth_c <- fish$log2_depth - mean(fish$log2_depth)
# Plot fish by area first with log2 area then with centered log2 area
pf1 <- ggplot(fish, mapping = aes(x = log2_acres, y = fish)) +
geom_point(alpha = 0.5, color = "slateblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_brt() +
xlab("Acres (log base 2)") +
ylab("Fish species")
#ggsave("fish_area.png", pf1, device = "png", width = 4, height = 3, units = "in")
pf1b <- ggplot(fish, mapping = aes(x = log2_acres_c, y = fish)) +
geom_point(alpha = 0.5, color = "slateblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_brt() +
xlab("Centered acres (log base 2)") +
ylab("Fish species") +
geom_vline(xintercept = 0, color = "red", linetype = "dashed")
#ggsave("fish_area_c.png", pf1b, device = "png", width = 4, height = 3, units = "in")
ggarrange(pf1, pf1b, nrow = 1)
Next, we fit a simple linear regression corresponding to each of the plots above and examined the summary of the fit model.
# simple linear model for lake area
mf <- lm(fish ~ log2_acres, data = fish)
# summary of estimates for this model
round(precis(mf), 2) # a more complete summary can be produced with summary(mf)
## mean sd 5.5% 94.5%
## (Intercept) -2.19 0.21 -2.53 -1.84
## log2_acres 1.69 0.03 1.64 1.74
# simple linear model for centered lake area
mfs <- lm(fish ~ log2_acres_c, data = fish)
# summary of estimates for this model
round(precis(mfs), 2) # a more complete summary can be produced with summary(mfs)
## mean sd 5.5% 94.5%
## (Intercept) 8.51 0.08 8.39 8.63
## log2_acres_c 1.69 0.03 1.64 1.74
Fish by lake depth
Repeating the figures for lake area above but this time using lake depth.
# Plot fish by depth first with log2 area then with centered log2 depth
pf2 <- ggplot(fish, mapping = aes(x = log2_depth, y = fish)) +
geom_point(alpha = 0.5, color = "slateblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_brt() +
xlab("Max depth in feet (log base 2)") +
ylab("Fish species")
#ggsave("fish_depth.png", pf2, device = "png", width = 4, height = 3, units = "in")
pf2b <- ggplot(fish, mapping = aes(x = log2_depth_c, y = fish)) +
geom_point(alpha = 0.5, color = "slateblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_brt() +
xlab("Centered max depth (log base 2)") +
ylab("Fish species") +
geom_vline(xintercept = 0, color = "red", linetype = "dashed")
#ggsave("fish_depth_c.png", pf2b, device = "png", width = 4, height = 3, units = "in")
ggarrange(pf2, pf2b, nrow = 1)
And then repeating the simple linear models to correspond with each lake depth figure.
# simple linear model for lake depth
md <- lm(fish ~ log2_depth, data = fish)
# summary of estimates for this model
round(precis(md), 2) # a more complete summary can be produced with summary(md)
## mean sd 5.5% 94.5%
## (Intercept) -1.48 0.37 -2.08 -0.89
## log2_depth 2.26 0.08 2.13 2.39
# simple linear model for centered lake depth
mds <- lm(fish ~ log2_depth_c, data = fish)
# summary of estimates for this model
round(precis(mds), 2) # a more complete summary can be produced with summary(mds)
## mean sd 5.5% 94.5%
## (Intercept) 8.51 0.10 8.34 8.67
## log2_depth_c 2.26 0.08 2.13 2.39
Multiple linear regression
We talked about how simple linear regression can be expanded into multiple linear regression with more than one predictor. Both lake area and lake depth are positively associated with fish diversity, so we fit a multiple linear regression with both measures as predictors.
# multiple linear regression for area and depth
mm <- lm(fish ~ log2_acres_c + log2_depth_c, data = fish)
# print output for this model
round(precis(mm), 2) # a complete summary can be produced with summary(mm)
## mean sd 5.5% 94.5%
## (Intercept) 8.51 0.08 8.39 8.63
## log2_acres_c 1.50 0.04 1.44 1.55
## log2_depth_c 0.67 0.07 0.56 0.78
Area by depth
Once we know lake area, adding lake depth in a multiple regression doesn’t tell us much more information about fish diversity. This is probably because lake area and depth are themselves pretty closely correlated, such that the same information about fish diversity is captured by both predictors.
# Plot lake area by depth
pf3 <- ggplot(fish, mapping = aes(x = log2_acres_c, y = log2_depth_c)) +
geom_point(alpha = 0.5, color = "slateblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_brt() +
xlab("Centered acres (log base 2)") +
ylab("Centered depth \n (log base 2)")
#ggsave("depth_size.png", pf3, device = "png", width = 4, height = 3, units = "in")
pf3